Search CORE

23 research outputs found

Learning workload behaviour models from monitored time-series for resource estimation towards data center optimization

Author: Buchaca Prats David
Publication venue: Universitat Politècnica de Catalunya
Publication date: 14/01/2021
Field of study

In recent years there has been an extraordinary growth of the demand of Cloud Computing resources executed in Data Centers. Modern Data Centers are complex systems that need management. As distributed computing systems grow, and workloads benefit from such computing environments, the management of such systems increases in complexity. The complexity of resource usage and power consumption on cloud-based applications makes the understanding of application behavior through expert examination difficult. The difficulty increases when applications are seen as "black boxes", where only external monitoring can be retrieved. Furthermore, given the different amount of scenarios and applications, automation is required. To deal with such complexity, Machine Learning methods become crucial to facilitate tasks that can be automatically learned from data. Firstly, this thesis proposes an unsupervised learning technique to learn high level representations from workload traces. Such technique provides a fast methodology to characterize workloads as sequences of abstract phases. The learned phase representation is validated on a variety of datasets and used in an auto-scaling task where we show that it can be applied in a production environment, achieving better performance than other state-of-the-art techniques. Secondly, this thesis proposes a neural architecture, based on Sequence-to-Sequence models, that provides the expected resource usage of applications sharing hardware resources. The proposed technique provides resource managers the ability to predict resource usage over time as well as the completion time of the running applications. The technique provides lower error predicting usage when compared with other popular Machine Learning methods. Thirdly, this thesis proposes a technique for auto-tuning Big Data workloads from the available tunable parameters. The proposed technique gathers information from the logs of an application generating a feature descriptor that captures relevant information from the application to be tuned. Using this information we demonstrate that performance models can generalize up to a 34% better when compared with other state-of-the-art solutions. Moreover, the search time to find a suitable solution can be drastically reduced, with up to a 12x speedup and almost equal quality results as modern solutions. These results prove that modern learning algorithms, with the right feature information, provide powerful techniques to manage resource allocation for applications running in cloud environments. This thesis demonstrates that learning algorithms allow relevant optimizations in Data Center environments, where applications are externally monitored and careful resource management is paramount to efficiently use computing resources. We propose to demonstrate this thesis in three areas that orbit around resource management in server environmentsEls Centres de Dades (Data Centers) moderns són sistemes complexos que necessiten ser gestionats. A mesura que creixen els sistemes de computació distribuïda i les aplicacions es beneficien d’aquestes infraestructures, també n’augmenta la seva complexitat. La complexitat que implica gestionar recursos de còmput i d’energia en sistemes de computació al núvol fa difícil entendre el comportament de les aplicacions que s'executen de manera manual. Aquesta dificultat s’incrementa quan les aplicacions s'observen com a "caixes negres", on només es poden monitoritzar algunes mètriques de les caixes de manera externa. A més, degut a la gran varietat d’escenaris i aplicacions, és necessari automatitzar la gestió d'aquests recursos. Per afrontar-ne el repte, l'aprenentatge automàtic juga un paper cabdal que facilita aquestes tasques, que poden ser apreses automàticament en base a dades prèvies del sistema que es monitoritza. Aquesta tesi demostra que els algorismes d'aprenentatge poden aportar optimitzacions molt rellevants en la gestió de Centres de Dades, on les aplicacions són monitoritzades externament i la gestió dels recursos és de vital importància per a fer un ús eficient de la capacitat de còmput d'aquests sistemes. En primer lloc, aquesta tesi proposa emprar aprenentatge no supervisat per tal d’aprendre representacions d'alt nivell a partir de traces d'aplicacions. Aquesta tècnica ens proporciona una metodologia ràpida per a caracteritzar aplicacions vistes com a seqüències de fases abstractes. La representació apresa de fases és validada en diferents “datasets” i s'aplica a la gestió de tasques d'”auto-scaling”, on es conclou que pot ser aplicable en un medi de producció, aconseguint un millor rendiment que altres mètodes de vanguardia. En segon lloc, aquesta tesi proposa l'ús de xarxes neuronals, basades en arquitectures “Sequence-to-Sequence”, que proporcionen una estimació dels recursos usats per aplicacions que comparteixen recursos de hardware. La tècnica proposada facilita als gestors de recursos l’habilitat de predir l'ús de recursos a través del temps, així com també una estimació del temps de còmput de les aplicacions. Tanmateix, redueix l’error en l’estimació de recursos en comparació amb d’altres tècniques populars d'aprenentatge automàtic. Per acabar, aquesta tesi introdueix una tècnica per a fer “auto-tuning” dels “hyper-paràmetres” d'aplicacions de Big Data. Consisteix així en obtenir informació dels “logs” de les aplicacions, generant un vector de característiques que captura informació rellevant de les aplicacions que s'han de “tunejar”. Emprant doncs aquesta informació es valida que els ”Regresors” entrenats en la predicció del rendiment de les aplicacions són capaços de generalitzar fins a un 34% millor que d’altres “Regresors” de vanguàrdia. A més, el temps de cerca per a trobar una bona solució es pot reduir dràsticament, aconseguint un increment de millora de fins a 12 vegades més dels resultats de qualitat en contraposició a alternatives modernes. Aquests resultats posen de manifest que els algorismes moderns d'aprenentatge automàtic esdevenen tècniques molt potents per tal de gestionar l'assignació de recursos en aplicacions que s'executen al núvol.Arquitectura de computador

UPCommons. Portal del coneixement obert de la UPC

Tesis Doctorals en Xarxa

Weighted Contrastive Divergence

Author: Castrillejo Ferran Mazzanti
Merino Enrique Romero
Pin Jordi Delgado
Prats David Buchaca
Publication venue
Publication date: 12/07/2018
Field of study

Learning algorithms for energy based Boltzmann architectures that rely on gradient descent are in general computationally prohibitive, typically due to the exponential number of terms involved in computing the partition function. In this way one has to resort to approximation schemes for the evaluation of the gradient. This is the case of Restricted Boltzmann Machines (RBM) and its learning algorithm Contrastive Divergence (CD). It is well-known that CD has a number of shortcomings, and its approximation to the gradient has several drawbacks. Overcoming these defects has been the basis of much research and new algorithms have been devised, such as persistent CD. In this manuscript we propose a new algorithm that we call Weighted CD (WCD), built from small modifications of the negative phase in standard CD. However small these modifications may be, experimental work reported in this paper suggest that WCD provides a significant improvement over standard CD and persistent CD at a small additional computational cost

arXiv.org e-Print Archive

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Sequence-to-sequence models for workload interference prediction on batch processing datacenters

Author: Berral García Josep Lluís
Buchaca Prats David
Carrera Pérez David
Marcual Medina Joan
Publication venue: 'Elsevier BV'
Publication date: 06/07/2020
Field of study

Co-scheduling of jobs in data centers is a challenging scenario where jobs can compete for resources, leading to severe slowdowns or failed executions. Efficient job placement on environments where resources are shared requires awareness on how jobs interfere during execution, to go far beyond ineffective resource overbooking techniques. Current techniques, most of which already involve machine learning and job modeling, are based on workload behavior summarization over time, rather than focusing on effective job requirements at each instant of the execution. In this work, we propose a methodology for modeling co-scheduling of jobs on data centers, based on their behavior towards resources and execution time and using sequence-to-sequence models based on recurrent neural networks. The goal is to forecast co-executed jobs footprint on resources throughout their execution time, from the profile shown by the individual jobs, in order to enhance resource manager and scheduler placement decisions. The methods presented herein are validated by using High Performance Computing benchmarks based on different frameworks (such as Hadoop and Spark) and applications (CPU bound, IO bound, machine learning, SQL queries...). Experiments show that the model can correctly identify the resource usage trends from previously seen and even unseen co-scheduled jobs.This work is supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement no. 639595); [Generalitat] de Catalunya under contract 2014SGR1051; the ICREA Academia program; and the BSC-CNS Severo Ochoa program (SEV-2015-0493); the Spanish Ministry of Economy under contract TIN2015-65316-P and the Generalitat.Peer ReviewedPostprint (author's final draft

arXiv.org e-Print Archive

UPCommons. Portal del coneixement obert de la UPC

Automatic generation of workload profiles using unsupervised learning pipelines

Author: Berral García Josep Lluís
Buchaca Prats David
Carrera Pérez David
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2017
Field of study

The complexity of resource usage and power consumption on cloud-based applications makes the understanding of application behavior through expert examination difficult. The difficulty increases when applications are seen as “black boxes”, where only external monitoring can be retrieved. Furthermore, given the different amount of scenarios and applications, automation is required. Here we examine and model application behavior by finding behavior phases. We use Conditional Restricted Boltzmann Machines (CRBM) to model time-series containing resources traces measurements like CPU, Memory and IO. CRBMs can be used to map a given given historic window of trace behaviour into a single vector. This low dimensional and time-aware vector can be passed through clustering methods, from simplistic ones like k-means to more complex ones like those based on Hidden Markov Models (HMM). We use these methods to find phases of similar behaviour in the workloads. Our experimental evaluation shows that the proposed method is able to identify different phases of resource consumption across different workloads. We show that the distinct phases contain specific resource patterns that distinguish them.Peer ReviewedPostprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Stopping criteria in contrastive divergence: Alternatives to the reconstruction error

Author: Buchaca Prats David
Delgado Pin Jordi
Mazzanti Castrillejo Fernando Pablo
Romero Merino Enrique
Publication venue
Publication date: 01/01/2014
Field of study

Restricted Boltzmann Machines (RBMs) are general unsupervised learning devices to ascertain generative models of data distributions. RBMs are often trained using the Contrastive Divergence learning algorithm (CD), an approximation to the gradient of the data log-likelihood. A simple reconstruction error is often used to decide whether the approximation provided by the CD algorithm is good enough, though several authors (Schulz et al., 2010; Fischer & Igel, 2010) have raised doubts concerning the feasibility of this procedure. However, not many alternatives to the reconstruction error have been used in the literature. In this manuscript we investigate simple alternatives to the reconstruction error in order to detect as soon as possible the decrease in the log-likelihood during learning.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Theta-Scan: Leveraging behavior-driven forecasting for vertical auto-scaling in container cloud

Author: Berral García Josep Lluís
Buchaca Prats David
Herron Mulet Claudia
Wang Chen
Youssef Alaa
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2021
Field of study

Detection of behavior patterns on resource usage in containerized Cloud applications is necessary for proper resource provisioning. Applications can use CPU/Memory with repetitive patterns, following a trend over time independently. By identifying such patterns, resource forecasting models can be fit better, reducing over/under-provisioning via fewer resizing operations. Here we present ThetaScan, a time-series analysis method for vertical auto-scaling of containers in the Cloud, based on the detection of stationarity/trending and periodicity on resource consumption. Our method leverages the Theta Forecaster algorithm with deseasonalization that, in our provisioning scenario, only requires the estimated periodicity for resource consumption as principal hyper-parameter. Commonly used behavior detection methods require manual hyper-parameter tuning, making them infeasible for automation. Besides, it can be used at multi-scales (minute/hour/day), detecting hourly and daily patterns to improve resource usage prediction. Experiments show that we can detect behaviors in resource consumption that common methods miss, without requiring extensive manual tuning. We can reduce the resizing triggers compared to fixed-size scheduling around ~ 10% – 15%, reduce over-provisioning of CPU and Memory through periodic-based provisioning. Also a ~ 60% on multiscale resource forecasting for traces showing periodicity at different levels in respect to single-scale.This work has been partially supported by the Spanish Government (contract PID2019-107255GB) and by Generalitat de Catalunya (contract 2014-SGR-1051).Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC

A highly parameterizable framework for Conditional Restricted Boltzmann Machine based workloads accelerated with FPGAs and OpenCL

Author: Berral García Josep Lluís
Buchaca Prats David
Cadenelli Nicola
Carrera Pérez David
Jaksic Zoran
Polo Bardés Jordà
Publication venue: 'Elsevier BV'
Publication date: 01/03/2020
Field of study

© 2020 Elsevier. This manuscript version is made available under the CC-BY-NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/Conditional Restricted Boltzmann Machine (CRBM) is a promising candidate for a multidimensional system modeling that can learn a probability distribution over a set of data. It is a specific type of an artificial neural network with one input (visible) and one output (hidden) layer. Recently published works demonstrate that CRBM is a suitable mechanism for modeling multidimensional time series such as human motion, workload characterization, city traffic analysis. The process of learning and inference of these systems relies on linear algebra functions like matrix–matrix multiplication, and for higher data sets, they are very compute-intensive. In this paper, we present a configurable framework for CRBM based workloads for arbitrary large models. We show how to accelerate the learning process of CRBM with FPGAs and OpenCL, and we conduct an extensive scalability study for different model sizes and system configurations. We show significant improvement in performance/Watt for large models and batch sizes (from 1.51x up to 5.71x depending on the host configuration) when we use FPGA and OpenCL for the acceleration, and limited benefits for small models comparing to the state-of-the-art CPU solution.This work was supported by the European Research Council(ERC) under the European Union’s Horizon 2020 research andinnovation programme (grant agreements No 639595); the Min-istry of Economy of Spain under contract TIN2015-65316-P andGeneralitat de Catalunya, Spain under contract 2014SGR1051;the ICREA, Spain Academia program; the BSC-CNS Severo Ochoaprogram, Spain (SEV-2015-0493) and Intel Corporation, UnitedStatesPeer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Learning workload behaviour models from monitored time-series for resource estimation towards data center optimization

Author: Buchaca Prats David
Publication venue: Universitat Politècnica de Catalunya
Publication date: 14/01/2021
Field of study

UPCommons. Portal del coneixement obert de la UPC

A multilayer extension of the similarity neural network

Author: Buchaca Prats David
Publication venue: Universitat Politècnica de Catalunya
Publication date
Field of study

Aquest projecte ajunta idees de les radial basis functions, i el multilayer perceptron per a desenvolupar una altra arquitectura de xarxa neuronal artificial i un mètode per a poder-la entrenar. És una extensió de la similarity neural network de Lluís Belanche

RECERCAT

Automatic Generation of Workload Profiles Using Unsupervised Learning Pipelines

Author: Berral Josep Ll.
Buchaca Prats David
Carrera David
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/03/2018
Field of study

The complexity of resource usage and power consumption on cloud-based applications makes the understanding of application behavior through expert examination difficult. The difficulty increases when applications are seen as “black boxes,” where only external monitoring can be retrieved. Furthermore, given the different amount of scenarios and applications, automation is required. Here, we examine and model application behavior by finding behavior phases. We use conditional restricted Boltzmann machines (CRBMs) to model time-series containing resources traces measurements like CPU, memory, and IO. CRBMs can be used to map a given historic window of trace behavior into a single vector. This low dimensional and time-aware vector can be passed through clustering methods, from simplistic ones like k -means to more complex ones like those based on hidden Markov models. We use these methods to find phases of similar behavior in the workloads. Our experimental evaluation shows that the proposed method is able to identify different phases of resource consumption across different workloads. We show that the distinct phases contain specific resource patterns that distinguish them.This project received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 639595). It is also partially supported by the Ministry of Economy of Spain under contract TIN2015-65316-P and Generalitat de Catalunya under contract 2014SGR1051, by the ICREA Academia program, and by the BSC-CNS Severo Ochoa program (SEV-2015-0493).Peer Reviewe